Skip to content

Conversation

@jiangzho
Copy link
Contributor

What changes were proposed in this pull request?

This PR makes PrometheusServlet publish snapshot that's compatible with OpenMetrics

Why are the changes needed?

Adopting OpenMetrics ensures our metrics output is forward-compatible with common observability tools, not only Prometheus, but also OpenTelemetry, Datadog and others.

Does this PR introduce any user-facing change?

This shall still be compatible witrh Prometheus, while adding more compatibility. No direct user facing changes otherwise.

How was this patch tested?

CIs with added unit test

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the CORE label Aug 29, 2025
@jiangzho jiangzho force-pushed the prometheus branch 3 times, most recently from b020853 to e1e463b Compare September 2, 2025 22:02
@jiangzho
Copy link
Contributor Author

jiangzho commented Sep 2, 2025

@peter-toth - would you mind help review this as well ? thanks

@peter-toth
Copy link
Contributor

peter-toth commented Sep 5, 2025

This PR is very similar to what you have done in apache/spark-kubernetes-operator#298 so I think it would make sense to split the formating logic to smaller parts like formatGauge(), formatHistogram()...

But my biggest concern is that apache/spark-kubernetes-operator#298 was a completely new feature, but here you change an already existing output format. Should't we offer a config to keep the old format as well?

cc @dongjoon-hyun as you seem to have worked a lot on this PrometheusServlet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for pointing this out!

I was considering that even if this is a Number - it may not be correctly formatted with simple getValue if we are dealing something complex, like BigDecimal, or AtomicLong. etc. By doing a doubleValue we avoid the possible toString gives us a string instead of number.

spark-kubernetes-operator is relatively new and we are sure there's no such gauges - but IMO we need to be taking this into consideration as well. I'll fix that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it cause any issues if we omit max, min and mean values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the catch!

Though openmetrics consider these as optional, we shall not be ruling them out as they were present currently. I'll add them

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@dongjoon-hyun
Copy link
Member

Thank you for pinging me, @peter-toth .

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Not only for format-wise, but also for the value-wise, does this PR emit the same set of values of metrics with the Apache Spark 4.0.1?

  2. Is there any chance of discontinuation of graphs in the upper layers like Grafana?

  3. May I ask how did you verify this, @jiangzho ?

This shall still be compatible witrh Prometheus, while adding more compatibility.

### What changes were proposed in this pull request?

This PR makes PrometheusServlet publish snapshot that's compatible with OpenMetrics

### Why are the changes needed?

Adopting OpenMetrics ensures our metrics output is forward-compatible with common observability tools, not only Prometheus, but also OpenTelemetry, Datadog and others.

### Does this PR introduce _any_ user-facing change?

This shall still be compatible witrh Prometheus, while adding more compatibility. No direct user facing changes otherwise.

### How was this patch tested?

CIs with added unit test

### Was this patch authored or co-authored using generative AI tooling?

No
@jiangzho
Copy link
Contributor Author

Not only for format-wise, but also for the value-wise, does this PR emit the same set of values of metrics with the Apache Spark 4.0.1?

Thanks for pointing that out!

Yes, there may be value differences.

For example, timer handling - while codahale gives us nanos, it's a Prometheus best practice to use seconds so we added the convertion.

Would you suggest a list of possible value changes like this?

Is there any chance of discontinuation of graphs in the upper layers like Grafana?

Since Grafana relies on Prometheus & does not query our pods directly, we'll need to ensure

  • This format is Prometheus compatible
    • OpenMetrics promises largely backwards compatible
    • For differences we adopted a few workarounds to keep the changes minimal
      • for example - since openmetrics requires no-precomputed percentiles in Histograms, we switched to summary
  • Prometheus is able to scrap and parse this format (we can test)

May I ask how did you verify this, @jiangzho ?

For this scope, we printed a sample snapshot (consisted of the metrics added in the unit tests), that looks like

# HELP metrics_counter1_total Counter metric
# TYPE metrics_counter1_total counter
metrics_counter1_total 42
# HELP metrics_gauge1 Gauge metric
# TYPE metrics_gauge1 gauge
metrics_gauge1 5.123
# HELP metrics_test_timer_duration_seconds Timer summary metric
# TYPE metrics_test_timer_duration_seconds summary
metrics_test_timer_duration_seconds{quantile="0.5"} 1.5
metrics_test_timer_duration_seconds{quantile="0.75"} 1.5
metrics_test_timer_duration_seconds{quantile="0.95"} 1.5
metrics_test_timer_duration_seconds{quantile="0.98"} 1.5
metrics_test_timer_duration_seconds{quantile="0.99"} 1.5
metrics_test_timer_duration_seconds{quantile="0.999"} 1.5
metrics_test_timer_duration_seconds_count 2
metrics_test_timer_duration_seconds_sum 2.0
# HELP metrics_test_timer_m1_rate Timer rate 1-min moving avg metric
# TYPE metrics_test_timer_m1_rate gauge
metrics_test_timer_m1_rate 0.0
# HELP metrics_test_timer_m5_rate Timer rate 5-min moving avg metric
# TYPE metrics_test_timer_m5_rate gauge
metrics_test_timer_m5_rate 0.0
# HELP metrics_test_timer_m15_rate Timer rate 15-min moving avg metric
# TYPE metrics_test_timer_m15_rate gauge
metrics_test_timer_m15_rate 0.0
# HELP metrics_test_hist Histogram metric
# TYPE metrics_test_hist summary
metrics_test_hist{quantile="0.5"} 75.0
metrics_test_hist{quantile="0.75"} 150.0
metrics_test_hist{quantile="0.95"} 150.0
metrics_test_hist{quantile="0.98"} 150.0
metrics_test_hist{quantile="0.99"} 150.0
metrics_test_hist{quantile="0.999"} 150.0
metrics_test_hist_count 3
metrics_test_hist_sum 250.0
# HELP metrics_test_hist_min Minimum value
# TYPE metrics_test_hist_min gauge
metrics_test_hist_min 25
# HELP metrics_test_hist_max Maximal value
# TYPE metrics_test_hist_max gauge
metrics_test_hist_max 150
# HELP metrics_test_hist_mean Mean value
# TYPE metrics_test_hist_mean gauge
metrics_test_hist_mean 75.0
# HELP metrics_test_hist_stddev Standard deviation value
# TYPE metrics_test_hist_stddev gauge
metrics_test_hist_stddev 51.370116691408136
# HELP metrics_test_meter_count_cumulative Meter counts metric
# TYPE metrics_test_meter_count_cumulative gauge
metrics_test_meter_count_cumulative 5
# HELP metrics_test_meter_mean_rate total counts metric
# TYPE metrics_test_meter_mean_rate gauge
metrics_test_meter_mean_rate 32432.36230840582
# HELP metrics_test_meter_m1_rate 1-min moving avg metric
# TYPE metrics_test_meter_m1_rate gauge
metrics_test_meter_m1_rate 0.0
# HELP metrics_test_meter_m5_rate 5-min moving avg metric
# TYPE metrics_test_meter_m5_rate gauge
metrics_test_meter_m5_rate 0.0
# HELP metrics_test_meter_m15_rate 15-min moving avg metric
# TYPE metrics_test_meter_m15_rate gauge
metrics_test_meter_m15_rate 0.0

And run a validation with promtool check metrics to confirm it's happy.

Would you suggest adding integration test for this ?

@dongjoon-hyun
Copy link
Member

In that case, I'd like to recommend to propose to enlarge your contribution toward a new OpenMetricsServlet.scala instead of legacy PrometheusServlet.scala.

Yes, there may be value differences.

In this way, Apache Spark can support OpenMetrics cleanly from Apache Spark 4.1.0.

@peter-toth
Copy link
Contributor

In that case, I'd like to recommend to propose to enlarge your contribution toward a new OpenMetricsServlet.scala instead of legacy PrometheusServlet.scala.

+1 to the new servlet idea.

@dongjoon-hyun
Copy link
Member

Gentle ping, @jiangzho . FYI, Apache Spark 4.1.0-preview2 vote will start next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants